[TRTLLM-5971][feat] Integrate helix parallelism #9342

brb-nv · 2025-11-20T20:41:18Z

Description

This MR integrates helix parallelism, an experimental feature, in TRTLLM.

Background:

Helix parallelism is a decode-only context parallelism method. Hence, it's used in disaggregated setting where only gen servers would have helix.
This involves sharding the request's seqlen across multiple CP (context parallel) ranks.
For a given query token in decode phase, “local attention” is computed w.r.t previous tokens on each CP rank.
Ensuing communication among CP ranks enables “correction” of local attention such that attention computation is exact.
Given KV parallelism is applicable only to attn layer, CP GPUs are "repurposed" to TP GPUs for FFN layer.

Changes in this MR:

At a broader level, we enable helix parallelism with DeepseekV3 and add a disagg integration test (a smoke test for now).
Example to explain the core changes:
- Suppose we are dealing with the first decode step for a request with ISL 7 and gen server has two-way context parallelism i.e. cpSize=2.
- Let's say first 4 tokens reside on cpRank0 and next 3 tokens reside on cpRank1.
- We have an incoming query token, q7 (corresponding to first generated token). While we perform local attn computation wrt to q7 on both cpRanks, its KV cache is written only to one cpRank (rank1 in the example) and the kv7 is also considered in local attn only on that rank. We call this rank "active helix rank".
Known limitation: Currently only the last CP rank is considered active rank. This shall be lifted in a follow-up MR.

Most changes in this MR enforce this:

KV cache is added for query token only on active rank in resource_manager.py.
Actual KV cache write happens in mla rope kernels and changes to rope kernels skip writing KV cache on inactive ranks.
The number of tokens considered in local attn computation is determined by seq_len_kv in trtllm.py which is also adjusted accordingly.

"Repurposing" attn CP ranks to FFN TP ranks can make things quite messy. To keep this readable,

We pass mapping with CP only to the attention layers in modeling_deepseekv3.py and pass mapping without cp to the rest.
We use a similar trick in communicator.py to obtain the right TP groups.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Added context parallelism support with Helix-based distributed inference capabilities
- DeepSeekV3 model now supports context parallelism for enhanced performance on multi-GPU setups
- New --cp_size command-line argument for configuring context parallel size (default: 1)
- Enhanced disaggregated serving configuration for context-tensor parallel distribution
Tests
- Added new test configuration for disaggregated DeepSeekV3 inference with context parallelism

_{✏️ Tip: You can customize this high-level summary in your review settings.}

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

coderabbitai · 2025-11-21T18:05:07Z

📝 Walkthrough

Walkthrough

This pull request implements context parallelism support with Helix configuration across the TensorRT-LLM inference stack. It adds per-rank inactivity tracking (helix_is_inactive_rank) to CUDA kernels and Python layers, introduces CP size configuration parameters, implements mapping repurposing logic for CP/TP distribution, and extends model initialization and executor logic to handle inactive Helix ranks during generation.

Changes

Cohort / File(s)	Summary
CUDA Kernel Signatures `cpp/tensorrt_llm/kernels/mlaKernels.cu`, `cpp/tensorrt_llm/kernels/mlaKernels.h`	Added `helix_is_inactive_rank` boolean pointer parameter to MLA rope generation kernel signatures; threaded through kernel invocations to gate token processing and K/V updates based on rank inactivity status.
Tensor Operations & Rope Generation `cpp/tensorrt_llm/thop/attentionOp.cpp`, `cpp/tensorrt_llm/thop/dsv3RopeOp.cpp`	Extended MLA tensor parameter handling to expect and forward two tensors (helix_position_offsets, helix_is_inactive_rank); added new field to MlaRopeGenArgs struct and propagated inactive rank mask through rope generation pipelines.
Torch Attention Backend `tensorrt_llm/_torch/attention_backend/trtllm.py`	Added `helix_position_offsets` and `helix_is_inactive_rank` to plan/forward/mla_rope_generation APIs; extended TrtllmAttentionMetadata with inactive rank tracking; adjusted KV length planning to exclude inactive rank contributions.
Distributed Communication `tensorrt_llm/_torch/distributed/communicator.py`	Implemented early CP communicator creation and mapping repurposing logic: when cp_size > 1, creates a copy with Helix mapping, scales TP by CP size, and restores original mapping after TP/PP communicator initialization.
Model Architecture `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Extended DeepseekV3 layer constructors with optional `mapping_with_cp` parameter; added CP rank/size extraction and weight-split logic for KV projection; implemented mapping repurposing during model initialization for cp_size > 1.
Attention Modules `tensorrt_llm/_torch/modules/attention.py`	Added `mapping_with_cp` parameter to MLA and Attention constructors; enforced num_heads equality and Helix CP type validation; updated forward paths to propagate helix parameters and support position_ids threading.
Executor & Resource Management `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`, `tensorrt_llm/_torch/pyexecutor/llm_request.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `py_helix_is_inactive_rank` flag to LlmRequest; implemented helix inactive rank tracking in model engine with conditional position/token calculations; gated KV cache allocation for inactive ranks in resource manager; extended AttentionMetadata with inactive rank exposure.
CLI & Configuration `examples/llm-api/quickstart_advanced.py`, `tensorrt_llm/commands/serve.py`	Added `--cp_size` and `cp_config` command-line arguments; propagated context_parallel_size through LLM initialization; implemented cp_type string-to-enum conversion with validation.
Infrastructure & Mapping `tensorrt_llm/llmapi/disagg_utils.py`, `tensorrt_llm/mapping.py`	Updated instance rank calculation to include context_parallel_size; added hardcoded Helix CP type fallback when cp_size > 1 to override externally provided cp_config.
Test Infrastructure `tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml`, `tests/integration/defs/disaggregated/test_disaggregated.py`	Added new disaggregated test configuration file for context TP and generation Helix setup; introduced test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix test case with model symlink setup.

Sequence Diagram(s)

sequenceDiagram
    participant Request
    participant ResourceMgr as Resource<br/>Manager
    participant ModelEngine
    participant AttentionBE as Attention<br/>Backend
    participant MLAKernel as MLA<br/>Kernel

    Request->>ResourceMgr: prepare_resources()
    activate ResourceMgr
    alt cp_size > 1 and not last rank
        ResourceMgr->>ResourceMgr: mark py_helix_is_inactive_rank=true
        ResourceMgr->>ResourceMgr: skip KV cache allocation
    else active rank
        ResourceMgr->>ResourceMgr: allocate KV cache normally
    end
    deactivate ResourceMgr

    Request->>ModelEngine: forward pass (generation)
    activate ModelEngine
    alt helix_is_inactive_rank[batch]==true
        ModelEngine->>ModelEngine: fix past_seen_token_num<br/>(no increment)
        ModelEngine->>ModelEngine: skip token processing
    else active
        ModelEngine->>ModelEngine: increment past_seen_token_num
        ModelEngine->>AttentionBE: plan() with helix params
    end
    deactivate ModelEngine

    AttentionBE->>AttentionBE: adjust kv_lens planning<br/>(exclude inactive ranks)
    AttentionBE->>MLAKernel: mla_rope_generation<br/>(helix_is_inactive_rank)
    activate MLAKernel
    alt helix_is_inactive_rank[batch]==true
        MLAKernel->>MLAKernel: skip token processing
        MLAKernel->>MLAKernel: skip K/V updates
    else active
        MLAKernel->>MLAKernel: apply rope & assign QKV
        MLAKernel->>MLAKernel: update K/V cache
    end
    deactivate MLAKernel

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Mapping repurposing logic (communicator.py, modeling_deepseekv3.py, mapping.py): Core logic for switching between CP and TP distributions; mutations and restorations must be correctly sequenced and scoped to avoid state leaks.
KV length planning adjustments (trtllm.py, model_engine.py): Changes to how KV cache lengths are calculated when inactive ranks are present; verify accounting is correct for all rank states.
Warmup control flow (model_engine.py): Conditional position_id and past_seen_token_num calculations based on warmup state and inactivity; ensure all branches are consistent.
Cross-layer parameter threading (executor_request_queue.py, model_engine.py, resource_manager.py): helix_is_inactive_rank flows through multiple abstraction layers; verify end-to-end propagation and type conversions (bool → tensor → pointer).
Model initialization side effects (modeling_deepseekv3.py): Temporary mapping mutations during model construction; verify original mapping is reliably restored even on error paths.

Suggested reviewers

schetlur-nv
nvchenghaoz
Shixiaowei02
Superjomn
Tabrizian
Funatiq
QiJune

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.04% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description provides a comprehensive explanation of helix parallelism, background context, specific implementation details with an example, test coverage commands, and confirmation of the PR checklist.
Title check	✅ Passed	The PR title '[TRTLLM-5971][feat] Integrate helix parallelism' clearly and specifically describes the main change: integration of helix parallelism into TensorRT-LLM.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/mapping.py (1)
455-491: Don’t silently override cp_config to HELIX for any cp_size > 1

This block in Mapping.__init__:
#################################################################
# TODO: Remove this hardcoding and obtain cp_config from llm_args.
if cp_size > 1:
    cp_config = {"cp_type": CpType.HELIX}
#################################################################
has broad side effects:

Any caller that provides a non-Helix cp_config (e.g. STAR or ULYSSES) with cp_size > 1 now gets that configuration silently discarded and treated as HELIX.

Code that branches on cp_config["cp_type"] (e.g. _merge_requests in executor_request_queue.py, STAR attention paths, etc.) will never see CpType.STAR/ULYSSES once cp_size > 1, effectively breaking those CP modes.

Additional cp_config fields (like STAR’s block_size / cp_anchor_size, or future Helix parameters) are lost.

If the intent is “for now we only support Helix when cp_size > 1”, it’s safer to:

Only inject a default when cp_config is missing; and

Fail fast on conflicting configs instead of overriding them:
# Temporary default until cp_config is fully plumbed from llm_args.
if cp_size > 1:
    if cp_config is None:
        cp_config = {"cp_type": CpType.HELIX}
    elif cp_config.get("cp_type") != CpType.HELIX:
        raise ValueError(
            f"Only CpType.HELIX is currently supported when cp_size > 1; got {cp_config.get('cp_type')!r}"
        )
That keeps Helix as the only supported multi-CP mode in this PR, but avoids surprising behavior for existing STAR/ULYSSES configs and makes future extension to other CP types straightforward.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1568-1623: Tighten helix_is_inactive_rank initialization guard; verify warmup dummy request semantics for Helix

The new Helix logic is mostly sound, but there is one definite initialization bug and one edge case to confirm:
helix_is_inactive_rank initialization guard is incorrect

The current initialization at line 1568:
helix_is_inactive_rank = [] if self.mapping.cp_size > 1 else None
initializes to an empty list for all CP configurations with cp_size > 1, but has_cp_helix() returns True only when both cp_size > 1 and cp_type == CpType.HELIX. For non-Helix CP types (e.g., regular CP or other variants), this creates an empty list that never gets populated, diverging from the None state that downstream consumers expect when Helix is disabled.

Fix: Change line 1568 to:
helix_is_inactive_rank = [] if self.mapping.has_cp_helix() else None
Warmup + Helix: past_seen_token_num override semantics need verification

During warmup, you correctly skip the position_id computation (line 1605), but past_seen_token_num is unconditionally overridden based on request.orig_prompt_len (lines 1608–1612) whenever Helix is active. This is fed into num_cached_tokens_per_seq, which becomes part of KVCacheParams. For dummy warmup requests, ensure:

orig_prompt_len is consistently initialized for all dummy request types created during warmup, and

the resulting KV cache index values remain within valid bounds on inactive Helix ranks.

Per-request inactivity flag wiring looks correct

The per-beam append pattern (lines 1572–1619) produces a helix_is_inactive_rank list with length equal to the total batch size (sum of beam widths), which matches the attention backend's [batch_size] expectation.
2526-2537: Behavioral inconsistency confirmed: ULYSSES passes warmup checks but fails at runtime

The review concern is valid. I found that the change introduces a systemic breaking behavior across three PyExecutor methods:

model_engine.py._prepare_inputs (line 2536): Raises for non-STAR/HELIX

executor_request_queue.py._merge_requests (line 725): Raises for non-STAR/HELIX

py_executor.py._update_request_states (line 2072): Raises for non-STAR/HELIX

The critical inconsistency:

Warmup check (model_engine.py line 564) accepts ULYSSES and returns early

Runtime execution (line 2536) raises NotImplementedError if ULYSSES reaches _prepare_inputs

This means if someone configures PyExecutor with cp_type=ULYSSES, it will pass initialization but crash during inference

ULYSSES is defined in the CpType enum and explicitly referenced at line 564, indicating it was intended to be handled. However, no fallback path exists in the three runtime dispatch methods, and no test coverage was found for ULYSSES with PyExecutor. The previous behavior would have silently fallen through to the default _prepare_tp_inputs path.

While there's no evidence that existing code uses ULYSSES with PyExecutor, the enum inclusion and warmup-time acceptance create an expectation of support that the runtime contradicts.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1446-1451: Fix TP sharding after restoring the original mapping

During DeepseekV3ForCausalLM.__init__ we repurpose CP ranks into TP by installing a temporary Mapping (tp_size = tp * cp). All decoder/MTP modules capture that object via self.mapping. Later we restore model_config.mapping back to the original CP-aware mapping. Here in DeepseekV3MTP.forward, the chunking uses self.model_config.mapping.tp_size/tp_rank, which now point to the restored mapping and no longer match the row-parallel tensors created with the repurposed mapping. On Helix runs (cp_size > 1) this leaves each rank feeding the wrong slice (or no slice) into eh_proj, breaking generation.

Use the same mapping object that the layer captured during init. A minimal fix:
-        tp_size = self.model_config.mapping.tp_size
-        tp_rank = self.model_config.mapping.tp_rank
+        tp_size = self.mapping.tp_size
+        tp_rank = self.mapping.tp_rank
That keeps the MTP sharding consistent with the repurposed TP groups.

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
645-705: Helix merge: avoid hardcoded tokens_per_block and ensure total_input_len_cp is available on children

Two points in the Helix path:

Hardcoded tokens_per_block=32
elif cp_type == CpType.HELIX:
    return self._merge_helix_requests(
        new_requests,
        tokens_per_block=32)
        # tokens_per_block=cp_config['tokens_per_block'])
This ignores any configured Helix block size (e.g. via cp_config['tokens_per_block'] or KV cache config) and makes the behavior fragile if someone changes the configured block size away from 32.

It also repeats a TODO you already noted to remove this hardcoding.

Suggestion:

Prefer pulling from config with a safe default + assert, e.g.:
tokens_per_block = cp_config.get('tokens_per_block', 32)
assert tokens_per_block > 0
return self._merge_helix_requests(new_requests, tokens_per_block=tokens_per_block)
or at minimum assert that a configured value, if present, matches 32 so misconfigurations fail loudly instead of silently diverging.

total_input_len_cp not propagated to child requests
req = executor_request_to_llm_request(...)
req.total_input_len_cp = input_len
req_with_children.append(req)
if req.child_requests:
    req_with_children.extend(req.child_requests)
executor_request_to_llm_request creates child requests via LlmRequest.create_child_request, which only copies attributes whose names start with py_.

As a result, total_input_len_cp exists only on the parent; any downstream code that expects this attribute on every LlmRequest (including children when num_return_sequences > 1) will not find it.

Possible fix:

Either rename to follow the py_ convention so it’s auto-copied:
req.py_total_input_len_cp = input_len
for child in req.child_requests:
    child.py_total_input_len_cp = input_len
Or, if you deliberately want a non-py_ attribute, explicitly set it on children in this loop.

This will keep Helix metadata consistent across parent and child requests and future-proof the code against differing tokens_per_block configs.
#!/bin/bash
# Check how Helix-related fields are used so they stay consistent.
rg -n "total_input_len_cp" -C3
rg -n "tokens_per_block" tensorrt_llm/_torch/pyexecutor -C3
Also applies to: 710-723

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

440-449: Decode-time KV allocation correctly gated on active Helix rank

Marking req.py_helix_is_inactive_rank on non-last CP ranks when has_cp_helix() and skipping add_token there ensures only the active Helix rank allocates decode-time KV cache, which matches the design.

You might consider using mapping.is_last_cp_rank() (and/or setting this flag once at request construction) for slightly clearer intent, but the current logic is functionally sound.

examples/llm-api/quickstart_advanced.py (1)

71-76: cp_size flag and context_parallel_size wiring are consistent

The new --cp_size argument and its use as context_parallel_size=args.cp_size in the LLM constructor align with the new CP plumbing. The change is self-contained and doesn’t affect existing callers.

Optionally, you might extend the help string for --cp_size to mention that multi-CP currently implies Helix in this flow, so users know what they’re opting into.

Also applies to: 261-264

tests/integration/defs/disaggregated/test_disaggregated.py (1)

154-274: New DeepSeek V3 Lite bf16 Helix disaggregated test wiring looks consistent

The new config entry and test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix follow the same symlink + run_disaggregated_test pattern as the existing DeepSeek tests, and the test_desc string matches the key added to config_map, so the wiring looks correct.

If you want to silence Ruff’s ARG001 warning, you could rename disaggregated_test_root to _disaggregated_test_root in the new test (or add a # noqa: ARG001), but that’s cosmetic and consistent with the rest of this file.

Also applies to: 1915-1933
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
12-13: Remove duplicate LlmRequest import

LlmRequest is imported twice in this file (here and again at line 62). You can safely drop the earlier import and keep the one that also brings in get_draft_token_length:
-from .llm_request import LlmRequest
-
 import torch
This keeps imports minimal without changing behavior.
tensorrt_llm/commands/serve.py (1)
5-5: Drop the duplicate gc import
Line 2 already imports gc, so this second import triggers Ruff F811 (Redefinition of unused gc). Please drop the extra line to keep lint happy.
-import gc

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2abb8 and 50436a1.

📒 Files selected for processing (18)

cpp/tensorrt_llm/kernels/mlaKernels.cu (4 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.h (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (6 hunks)
examples/llm-api/quickstart_advanced.py (2 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (9 hunks)
tensorrt_llm/_torch/distributed/communicator.py (2 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
tensorrt_llm/_torch/modules/attention.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tensorrt_llm/commands/serve.py (7 hunks)
tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/mapping.py (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)

🧰 Additional context used

🧠 Learnings (27)

📓 Common learnings

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/modules/attention.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

tensorrt_llm/_torch/distributed/communicator.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/mapping.py

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/modules/attention.py

🧬 Code graph analysis (12)

tensorrt_llm/_torch/distributed/communicator.py (2)

tensorrt_llm/mapping.py (3)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/llmapi/llm_args.py (2)

world_size (459-460)

world_size (469-473)

tests/integration/defs/disaggregated/test_disaggregated.py (2)

tests/integration/defs/conftest.py (4)

disaggregated_test_root (2618-2623)

disaggregated_example_root (285-290)

llm_venv (702-719)

deepseek_v3_model_root (616-631)

tests/integration/defs/local_venv.py (1)

get_working_directory (43-49)

cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

tensorrt_llm/_torch/distributed/communicator.py (5)

tp_size (64-65)

has_pp (52-53)

cp_size (56-57)

rank (40-41)

rank (457-458)

tensorrt_llm/mapping.py (3)

has_pp (258-259)

rank (187-188)

rank (191-198)

tensorrt_llm/mapping.py (1)

tensorrt_llm/_torch/distributed/communicator.py (2)

cp_size (56-57)

cp_config (108-109)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/_torch/distributed/communicator.py (3)

has_cp_helix (104-105)

cp_rank (68-69)

cp_size (56-57)

tensorrt_llm/mapping.py (2)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/device_mesh.py (1)

cp_rank (84-86)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

tensorrt_llm/_torch/distributed/communicator.py (8)

cp_size (56-57)

cp_rank (68-69)

tp_size (64-65)

world_size (44-45)

rank (40-41)

rank (457-458)

cp_config (108-109)

pp_size (60-61)

tensorrt_llm/mapping.py (4)

cp_rank (534-535)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/_torch/model_config.py (1)

ModelConfig (75-616)

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

LlmRequest (437-663)

append (101-127)

append (195-212)

cached_tokens (569-570)

cached_tokens (573-576)

tensorrt_llm/mapping.py (3)

CpType (24-32)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/distributed/communicator.py (3)

cp_size (56-57)

has_cp_helix (104-105)

cp_rank (68-69)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

is_warmup (344-345)

is_warmup (348-353)

tensorrt_llm/commands/serve.py (3)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/mapping.py (1)

CpType (24-32)

tensorrt_llm/_torch/distributed/communicator.py (4)

cp_config (108-109)

tp_size (64-65)

pp_size (60-61)

cp_size (56-57)

examples/llm-api/quickstart_advanced.py (1)

tensorrt_llm/_torch/distributed/communicator.py (1)

cp_size (56-57)

tensorrt_llm/_torch/attention_backend/trtllm.py (3)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/attention_backend/flashinfer.py (1)

cached_token_lens (116-118)

tensorrt_llm/_torch/attention_backend/interface.py (2)

seq_lens_kv (226-227)

seq_lens_kv (230-237)

tensorrt_llm/_torch/modules/attention.py (3)

cpp/tensorrt_llm/kernels/helixKernels.h (1)

tensorrt_llm (26-46)

tensorrt_llm/mapping.py (1)

CpType (24-32)

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

helix_position_offsets (109-134)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/distributed/communicator.py

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/disaggregated/test_disaggregated.py

1919-1919: Unused function argument: disaggregated_test_root

(ARG001)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1561-1561: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

1588-1588: f-string without any placeholders

Remove extraneous f prefix

(F541)

1660-1660: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/commands/serve.py

5-5: Redefinition of unused gc from line 2

Remove definition: gc

(F811)

130-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

130-131: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/llmapi/disagg_utils.py (1)

199-201: Including PP and CP in instance_num_ranks is consistent with mapping/world-size logic

Using tensor_parallel_size * pipeline_parallel_size * context_parallel_size makes the per-instance rank count match the full mapping used later in split_world_comm, while preserving backward compatibility when PP/CP are omitted (default 1). Looks good.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1)

1-32: Helix disaggregated gen config is internally consistent

Context and generation sections use TP/CP sizes in a way that matches the updated disaggregation logic (context_parallel_size only on the generation side for Helix decode-only). No issues spotted.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

441-513: py_helix_is_inactive_rank flag wiring is consistent

Initializing self.py_helix_is_inactive_rank = False alongside other py_* fields and relying on create_child_request’s py_ copying is exactly what the resource manager needs for per-request Helix inactivity tracking. No changes requested.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

316-320: CP-aware attachment of Python-only request metadata looks good

Extending the condition to (tp_size > 1 or has_pp or cp_size > 1) ensures CP-only topologies also receive Python-side attributes on non-root ranks, without affecting single-rank setups. This is the right direction.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

181-238: No changes needed; concerns are not supported by the codebase

This review comment identifies issues that do not exist in the actual code:

Backward compatibility: All current Python call sites already pass exactly 2 tensors in mla_tensor_params (lines 481 and 1771 in tensorrt_llm/_torch/attention_backend/trtllm.py). The size() == 2 check enforces the existing pattern consistently, not a breaking change.

Uninitialized helix fields: The MlaParams struct explicitly initializes both helix_position_offsets{nullptr} and helix_is_inactive_rank{nullptr} as member initializers, ensuring they are always zero-initialized regardless of constructor form. The suggested MlaParams<T> mla_params{} is redundant.

API design: The required parameter without defaults aligns with TensorRT-LLM's intentional design for consistency in tensor parameter collections (documented in learnings from PR 6904).

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

108-112: Helix inactivity flag plumbed cleanly into MLA params

The new helix_is_inactive_rank pointer mirrors the existing helix_position_offsets pattern and keeps the struct’s nullable‑by‑default contract; no issues spotted here.

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)

45-70: Helix inactivity flag is correctly threaded through MLA rope generation

The new helix_is_inactive_rank pointer is plumbed cleanly from mla_tensor_params into MlaRopeGenArgs and then into MlaParams, with a strict but reasonable contract (mla_tensor_params.size() == 2). Optional handling via has_value() ? data_ptr : nullptr matches the existing pattern for helix_position_offsets. No functional issues spotted.

Also applies to: 88-111, 139-168, 277-283
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
545-566: Warmup gating by cp_type now correctly excludes only ULYSSES/STAR

The updated warmup logic:
cp_type = self.mapping.cp_config.get('cp_type', None)
if cp_type is not None:
    if cp_type in [CpType.ULYSSES, CpType.STAR]:
        return
means HELIX (and other non‑ULYSSES/STAR cp types) still run warmup, which is what you want for Helix CUDA graph / torch.compile specialization. This looks consistent with the new Helix integration and doesn’t affect non‑CP runs.

tensorrt_llm/_torch/attention_backend/trtllm.py

tensorrt_llm/_torch/distributed/communicator.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

tensorrt_llm/commands/serve.py

tensorrt_llm/_torch/distributed/communicator.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

chuangz0

looks good to me for disagg part

brb-nv · 2025-11-26T21:20:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-26T21:25:29Z

PR_Github #25890 [ run ] triggered by Bot. Commit: 26f4d07

tensorrt-cicd · 2025-11-26T21:25:31Z

PR_Github #25881 [ run ] completed with state ABORTED. Commit: 28e50c0
LLM/main/L0_MergeRequest_PR #19626 (Blue Ocean) completed with status: ABORTED

QiJune

LGTM

brb-nv · 2025-11-27T02:09:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-27T02:14:23Z

PR_Github #25890 [ run ] completed with state SUCCESS. Commit: 26f4d07
/LLM/main/L0_MergeRequest_PR pipeline #19634 completed with status: 'FAILURE'

tensorrt-cicd · 2025-11-27T02:14:54Z

PR_Github #25932 [ run ] triggered by Bot. Commit: 3642531

nvpohanh

Approved because this PR doesn't touch files that are owned by kv-cache manager devs: https://github.com/NVIDIA/TensorRT-LLM/blob/main/.github/CODEOWNERS

tensorrt-cicd · 2025-11-27T15:01:53Z

PR_Github #25932 [ run ] completed with state SUCCESS. Commit: 3642531
/LLM/main/L0_MergeRequest_PR pipeline #19663 completed with status: 'FAILURE'

brb-nv · 2025-11-27T15:03:39Z

/bot run --disable-fail-fast --only-multi-gpu-test

brb-nv · 2025-11-27T15:05:23Z

/bot run --disable-fail-fast --only-multi-gpu-test

tensorrt-cicd · 2025-11-27T15:09:25Z

PR_Github #26043 [ run ] triggered by Bot. Commit: b89b5c5

tensorrt-cicd · 2025-11-27T15:10:55Z

PR_Github #26044 [ run ] triggered by Bot. Commit: b89b5c5

tensorrt-cicd · 2025-11-27T15:10:57Z

PR_Github #26043 [ run ] completed with state ABORTED. Commit: b89b5c5

Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv · 2025-11-27T16:01:56Z

/bot run --disable-fail-fast --only-multi-gpu-test

tensorrt-cicd · 2025-11-27T16:07:17Z

PR_Github #26051 [ run ] triggered by Bot. Commit: 83d6416

tensorrt-cicd · 2025-11-27T16:07:19Z

PR_Github #26044 [ run ] completed with state ABORTED. Commit: b89b5c5
LLM/main/L0_MergeRequest_PR #19768 (Blue Ocean) completed with status: ABORTED

brb-nv changed the title ~~User/brb/integrate helix on main redo mr~~ [None][feat] Integrate helix parallelism Nov 20, 2025

brb-nv mentioned this pull request Nov 20, 2025

[None][feat] Integrate helix on main #8894

Closed

1 task

brb-nv commented Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 2 times, most recently from 812dfb9 to 50436a1 Compare November 21, 2025 17:43

brb-nv marked this pull request as ready for review November 21, 2025 17:51

brb-nv requested review from a team as code owners November 21, 2025 17:51

brb-nv requested review from MatthiasKohl, Shixiaowei02, hlu1, laikhtewari and syuoni November 21, 2025 17:51

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/trtllm.py Show resolved Hide resolved

tensorrt_llm/_torch/distributed/communicator.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/distributed/communicator.py Outdated Show resolved Hide resolved

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 6f7ffc7 to ec20a04 Compare November 22, 2025 04:04

brb-nv requested a review from a team as a code owner November 23, 2025 01:17

brb-nv requested a review from chuangz0 November 23, 2025 01:17

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 4 times, most recently from ec9faa5 to 7eabb38 Compare November 23, 2025 03:50

syuoni reviewed Nov 25, 2025

View reviewed changes

chuangz0 approved these changes Nov 25, 2025

View reviewed changes

brb-nv requested a review from MatthiasKohl November 26, 2025 21:20

QiJune approved these changes Nov 27, 2025

View reviewed changes

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 2 times, most recently from 0ef0bc2 to 3642531 Compare November 27, 2025 02:08

nvpohanh approved these changes Nov 27, 2025

View reviewed changes

brb-nv enabled auto-merge (squash) November 27, 2025 02:21

lowsfer approved these changes Nov 27, 2025

View reviewed changes

brb-nv requested review from a team as code owners November 27, 2025 14:57

brb-nv requested review from mlefeb01 and zeroepoch November 27, 2025 14:57

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from feb598c to b89b5c5 Compare November 27, 2025 15:04

chzblych approved these changes Nov 27, 2025

View reviewed changes

[TRTLLM-5971][feat] Integrate Helix Parallelism

83d6416

Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from b89b5c5 to 83d6416 Compare November 27, 2025 16:01

[TRTLLM-5971][feat] Integrate helix parallelism #9342

Are you sure you want to change the base?

[TRTLLM-5971][feat] Integrate helix parallelism #9342

Conversation

brb-nv commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chuangz0 left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

nvpohanh left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

brb-nv commented Nov 27, 2025

Uh oh!

brb-nv commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

brb-nv commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

brb-nv commented Nov 20, 2025 •

edited

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading